Data Visualization Project 2: Analysis of the World Happiness dataset

Author

Caroline Graebel, Nina Immenroth, Bogdan Kostić, Naim Zahari

1 Background

1.1 About the World Happiness Report

The World Happiness Report (WHR) is a reflection of the worldwide evaluation of happiness and well-being across 170 countries (World Happiness Report). WHR measures the happiness state on a yearly basis and emphasizes the need for incorporating well-being factors in government policy making.

The report is a collaboration of Gallup, the Oxford Wellbeing Research Centre, the United Nation’s Sustainable Development Solutions Network and the WHR’s Editorial Board. The first report was released in the year 2012.

1.1.1 Understanding the Happiness Score

Happiness scores are measured based on the survey responses from the Gallup World Poll (World Happiness Report). The evaluation from the Gallup World Poll primarily stems from answers of the survey questions and the Cantril Ladder score. The poll comprises of more than 100 questions that includes global questions and region-specific topics regarding several well-being areas. Those well-being areas include law and order, good jobs, well-being, food and shelter, institutions and infrastructure, and brain gain (Wikipedia 2024). Gallup typically collects for a sample size of 1,000 people for each country annually.

Next, the Cantril Ladder score is a score range of 0 to 10, where 0 being the worst well-being condition and 10 being the best well-being condition. The Cantril Ladder asks to rate the current state of the respondents’ well-being between 0 and 10. The country happiness rankings are representative from three year samples of the poll. The ladder score is the overall national happiness score. Sometimes, the term “life evaluation” is used interchangeably with happiness score.

1.2 6 Explanatory Variables

In Figure 1, which is Figure 2.1 of the World Happiness Report 2024, the happiness score or life evaluation is indicated for each country in the entire dataset.

Figure 1: WHR Figure 2.1: Country Rankings by Life Evaluations in 2021-2023

Six explanatory variables are derived from the poll answers and the estimations of those six variables associating with the happiness score would explain the variation across different countries. Those variables include Gross Domestic Product (GDP) per capita, healthy life expectancy, social support, freedom, corruption and generosity. Definitions of those variables will be explained in the Data section of this report.

According to the report, the sub-bars do not influence the total score reported for each country (World Happiness Report 2024), however they merely indicate the country’s overall score explained by each of the six explanatory variables (World Happiness Report). The sub-bars are measured by multiplying average data for the period of the last 3 years for each of those variables. For the 2024 report, the national average for 2021-2023 are calculated for the display of the sub-bars for each variable.

2 Software and Tools

For the analysis, RStudio Desktop environment was used. The following packages were used for data manipulation, visualization, and analysis.

  • here

  • tidyverse

  • GGally

  • grid

  • randomForest

  • corrplot

Functions from these packages were used to conduct the analysis, with the following functions being particularly important:

  • randomForest

    randomForest implements Breiman’s random forest algorithm for classification and regression. Random subsets of data are used when training a random forest model. From the model evaluation, important features can be determined.

3 Data

3.1 World Happiness Report Data 2015-2023

The dataset was obtained from the Kaggle datasets collection, where it includes the World Happiness Report data between 2015 and 2023. The dataset contains the country name, region name, national happiness score, and the other explanatory factors (Islam 2023). Those explanatory factors are GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, and perceptions of corruption. The variables in the dataset represent socio-economic and well-being statuses from the individuals who participated in the Gallup World Poll in that duration.

3.2 World Happiness Report 2024 Chapter 2 Appendix

The dataset for the Chapter Appendix comes from the survey responses of the Gallup World Poll from 2005 and 2023. The dataset contains demographical information, namely country name and region name, and numerical information for GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity and perceptions of corruption. Contrary to the previous dataset, the GDP per capita and healthy life expentancy columns is in its true average representation and not adjusted for the life evaluation score.

3.3 Variable Definitions

Happiness score is a measure of subjective well-being (SWB) in the Gallup World Poll (GWP) covering years from 2005/06 to 2023. Unless stated otherwise, it is the national average response to the question of life evaluations. The question in the poll that directly measures the score is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”

This measure is also referred to as Cantril life ladder.

GDP per capita is the purchasing power parity at 2017 constant international dollar prices from World Development Indicators. When the latest GDP per capita is not available as of a certain date of the year, country-specific forecasts of real GDP growth from the Economic Outlook is used for average calculation. If it is unavailable from the Economic Outlook, then forecasts from the World Bank’s Global Economic Prospects are utilized.

Healthy Life Expectancy (LSE) at birth are based on the Global Health Observatory data by the World Health Organization (WHO).

Social support represents the national average of the ability to rely on someone in times of trouble and the representation is in binary responses (either 0 or 1). In the Gallup World Poll, the binary response is collected from the question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.

Freedom represents the ability to make life decisions from the Gallup World Poll question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”.

Generosity is the national average of answers to the Gallup World Poll question “Have you donated money to a charity in the past month?”.

Corruption is the perceptive measure from the following 2 questions in the Gallup World Poll. Those questions are “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?”.

Positive affect is defined as the average of three positive affect measures in the GWP: laugh, enjoyment and doing interesting things in the Gallup World Poll.

These measures are the responses to the following three questions, respectively:

  • “Did you smile or laugh a lot yesterday?”

  • “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Enjoyment?”

  • “Did you learn or do something interesting yesterday?”

Negative affect is defined as the average of three negative affect measures in the GWP. They are worry, sadness and anger.

Those emotional states are respective responses to the following questions:

  • “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Worry?”

  • “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Sadness?”

  • “Did you experience the following feelings during A LOT OF THE DAY yesterday? How about Anger?”

Dystopia is an imaginary country that has the world’s least-happy people. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables, thus allowing each sub-bar to be of positive (or zero, in six instances) width. The lowest scores observed for the six key variables, therefore, characterize Dystopia.

4 Top Level analysis of the data

To explore interesting structures in the data we will use basic variable plotting around the main variable “happiness_score” to get a first feeling for the data. Afterwards we will look into how a RandomForest Classifier model would rate the importance of variables when trying to predict the happiness score. Lastly, a Kmeans model is used to look into a possible clustering of the data and how the resulting clusters can be interpreted.

4.1 Looking at distribution of the happiness score

Figure 2: Distribution graph of the happiness score.

When looking at our variable of interest, it’s clearly visible that the distribution is quite close to a normal distribution.

4.2 Looking into relationship of the happiness score to predictors

4.2.1 Region

Figure 3: Boxplot Happiness Score based on different regions.

When looking at how the happiness score is connected to the regions of the countries we have data for, we can see that they differ a lot in their spread. Western Europe and North America and ANZ have the comparatively highest median happiness scores. In contrast, for Sub-Saharan Africa and South Asia the median happiness is the lowest. The regions also differ strongly in the variance. The interquartile range (the range of variables within the first and third quarter percentiles) is larger in the Middle East and North Africa while for Africa and North America the boxes are very small.
Regions is understood as an aggregation for the variable country, as there are a big number of countries covered that it is not possible to visualize it. We are looking into countries that show interesting patterns later in the report.

(a) Happiness Score over GDP per Capita
(b) Happiness Score over Social Support
(c) Happiness Score over Healthy Life Expectancy
(d) Happiness Score over Freedom to make Life Choices
(e) Happiness Score over Generosity
(f) Happiness Score over Perceptions of Corruption
Figure 4: Scatterplots for relationship of happiness score and different predictor variables.
Figure 5: Correlation matrix plot.

4.2.2 GDP per Capita

There seems to be a positive linear relationship between GDP per capita and the happiness score, as with higher GDP the happiness score rises. The relationship is stronger than it seems from the plot, as the scale ratio of x- and y-axis are so different. Calculating the correlation helps to provide further proof for the linear relationship between the variables.

Correlation of the happiness score and GDP per capita: 0.7238105

The correlation of these variables, which can be seen in figure Figure 5, is very high with a value of 0.72.

4.2.3 Social Support

Similar to GDP per Capita, there is a positive linear relationship between both variables.

Correlation of the happiness score and social support: 0.6481553

The correlation between social support and happiness score isn’t as strong as for GDP per capita but still very strong.

4.2.4 Healthy life expectancy

Here, we again have a positive linear relationship between the two variables. Interestingly, for healthy life expectancy some values are missing.

Correlation of the happiness score and healthy life expectancy: 0.6823998

There a strong positive correlation between healthy life expectancy and the happiness score.

4.2.5 Freedom to make life choices

In this case, there also is a positive linear relationship between freedom to make life choices and happiness score, even though it seems a bit weaker than with the variables discussed so far.

Correlation of the happiness score and freedom to make life choices: 0.5694581

In general, there is a strong positive correlation between the variables but it is the weakest so far.

4.2.6 Generosity

There is no connection between the variables visible here.

4.2.7 Perceptions of corruption

For perceptions of corruption, there seems to be a slight upward trend in happiness score for higher perceptions of corruption but it’s only a fraction of data points that score high on perceptions of corruption. Also there is missing data for this variable.

Correlation of the happiness score and Perceptions of corruption: 0.4150709

Calculating the correlation shows that there is again a positive correlation between the two variables, even if it appears weak compared to other variables that have been looked at.

4.2.8 Year

Figure 6: Boxplot Happiness Score over Years.

For medians, there is a slight upwards trend over the years, which 2020 to 2022 having little variance compared to the other years. There are also negative outliers for 2021 to 2023.

4.3 Using K-means Clustering to find interesting patterns

It is the goal to use K-means clustering to provide a pattern that is interpretable and give further context to the relationship between the happiness score and other variables. The procedure will first be introduced, then the data is scaled for an optimal performance and the best amount of clusters will be chosen by using the elbow-criteria to sensibly minimize total within sum-of-squares. In other words, the goal is to minimize the overall distance between points in each cluster. At the end, a final model is trained and the result is plotted.
It is important to mention that K-Means can only be used for continuous variables, so country, region, and year are not considered here.

4.3.1 Introduction K-Means

K-Means Clustering is an unsupervised machine learning method that can help classify data that has no label that a model could be trained on. This is made possible by a distance-based approach that fits the clusters so that the distances between the data points within a cluster are minimal. The distance in this case is a measure of similarity between data points, so in other words we want to fit clusters so that the points contained in one cluster are as similar as possible.

4.3.2 Trying PCA for K-Means performance optimization

A good measure for improving the performance of K-Means is doing PCA on the data as it can aggregate the information. However, PCA is only useful if you can cover around 80% of the data with only few principle components.

Figure 7: Cumulative Propability of the Variance explained by each Principle Component.

As can be seen, it would be necessary to use four to five of seven variables to cover a sufficient amount of variance. Since this isn’t too helpful and each principle component contains a similar amount of variance, there will be no PCA used before doing K-Means.

4.3.3 Finding a good amount of clusters

In general, the higher the number of clusters, the more similar points will be within one cluster. However, the model also gets harder to interpret and messy. So using the elbow-criteria, the hyperparameter k that equals the number of clusters will be sensibly minimized so that we get an interpretable result.
We test the total within sum of squares for two to ten clusters.

Figure 8: Optimal k-value for Kmeans algorithm when applying the elbow-criterium to the within sum of squares plot with marked cut-off value of k = 2.

Using the elbow criterium, it can be seen that after k = 2, the decrease of within sum of squares isn’t as strong anymore. Therefore, the final K-Means model is trained with two clusters.

4.3.4 Resulting plots

Figure 9: Matrix of scatterplots coloured by cluster.

When looking at the scatterplots, the first thing to be noted is that for happiness score, there is the cleanest split between the clusters which basically splits the plot into a happiness score that is bigger than 0 and vice versa. In other words, the clusters are strongly informed by whether the happiness score is in the upper 50% quantile or below.

Median of the happiness score: -0.0004473365

For further context, the median value of the happiness score is almost perfectly zero. From the initial plots it has been shown that the happiness score correlates with all variables plotted here except for generosity. We can see that all plots in the matrix show a similar pattern of a red cluster on the right and a black cluster on the left. The stronger the correlation between any variables, the less overlap the clusters appear to have. For GDP per capita and healthy life expectancy for example, the scatterplot shows a positive linear relationship between the two variables and the overlap is very little compared to generosity’s interaction with freedom to make life choices, where most of the clusters overlap and no strong correlation exists.

4.4 Using Random Forest to rank feature importance

When training a random forest on data, multiple trees are fitted on random subset of predictors in each iteration. The resulting plot provides two different types of variable importance which means how much a variable is contributing to improving the model.

Figure 10: Variable importance when predicting the happiness score with a random forest model.

The left plot shows how much a variable contributes to decreasing the mean squared error of the resulting model. The mean squared error measures the difference between the true values and predictions. The variables are ordered by how much the variables increase the mean squared error when being left out. From this example, GDP per capita improves the tree by lowering the MSE the most and vice versa - when it is missing, the MSE is a lot higher.
Node Purity on the other hand represents how well the tree splits the data in similar target values. The more often variable is used for splitting a node in two further branches, the more important is the variable. In this plot GDP per capita again ranks highest, which means that it provides a useful condition on which to split groups.
When predicting the happiness score, GDP per capita and healthy life expectancy are ranked the highest for both importancy measures. This coincides with the high correlations that have been measured for the scatterplots between happiness score and the two variables.
An important thing to keep in mind is that there’s an element of randomness in these results. For different seeds, the top four variables are stable, but how important they are and their ranking can be different. It also should be noted at an increased MSE of 20% for example freedom to make life choices is still a strong impact, even if this variable appears of low rank compared to others.

5 References

  1. World Happiness Report. About. Retrieved July 26, 2024, from https://worldhappiness.report/about/
  2. World Happiness Report. World Happiness Report Appendices & Date. Retrieved July 26, 2024 from https://worldhappiness.report/data/
  3. Wikipedia. (2024, July). Gallup, Inc. Retrieved July 27, 2024, from https://en.wikipedia.org/wiki/Gallup,_Inc.#Gallup_World_Poll
  4. World Happiness Report. (2024). World Happiness Report 2024 Figure 2.1: Country Rankings by Life Evaluations in 2021-2023. Retrieved July 27, 2024, from https://public.tableau.com/app/profile/worldhappiness/viz/2024Draft/Figure2_1
  5. World Happiness Report. FAQ. Retrieved July 27, 2024, from https://worldhappiness.report/faq/
  6. World Happiness Report. (2024, March 12). Appendix 1: Statistical Appendix for Chapter 2 of World Happiness Report 2024. Retrieved July 26, 2024, from https://happiness-report.s3.amazonaws.com/2024/Ch2+Appendix.pdf
  7. Islam, S. (2023) World Happiness Report up to 2023. Retrieved July 1, 2024, from https://www.kaggle.com/datasets/sazidthe1/global-happiness-scores-and-factors